A Closer Look at Skip-gram Modelling
نویسندگان
چکیده
Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.
منابع مشابه
I'm No Longer a Child: A Closer Look at the Interaction Between Iranian EFL University Students' Identities and Their Academic Performance
Although university EFL students represent a wide array of social and cultural identities, their multiple and diverse identities are not usually considered in foreign language classrooms. This qualitative case study attempted to examine identity conflicts experienced by Iranian EFL learners at the university context. To this end, two Shiraz University students' identities were investigated. Sem...
متن کاملA Multitask Objective to Inject Lexical Contrast into Distributional Semantics
Distributional semantic models have trouble distinguishing strongly contrasting words (such as antonyms) from highly compatible ones (such as synonyms), because both kinds tend to occur in similar contexts in corpora. We introduce the multitask Lexical Contrast Model (mLCM), an extension of the effective Skip-gram method that optimizes semantic vectors on the joint tasks of predicting corpus co...
متن کاملExplaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis
The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA—a form of matrix factorization. This makes it clear that we can extend the skip-gram method to tensor factorization, in order to train embeddings through richer higher-order coocurrences, e.g., triples that include...
متن کاملword2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA
Mikolov et al. (2013) introduced the skip-gram formulation for neural word embeddings, wherein one tries to predict the context of a given word. Their negative-sampling algorithm improved the computational feasibility of training the embeddings. Due to their state-of-the-art performance on a number of tasks, there has been much research aimed at better understanding it. Goldberg and Levy (2014)...
متن کاملBreaking Sticks and Ambiguities with Adaptive Skip-gram
The recently proposed Skip-gram model is a powerful method for learning high-dimensional word representations that capture rich semantic relationships between words. However, Skipgram as well as most prior work on learning word representations does not take into account word ambiguity and maintain only a single representation per word. Although a number of Skip-gram modifications were proposed ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006